Question 1

a.)

True, if the response variable has correlation 0 with all the predictor variables, then the only predictor would be the intercept, with slope 0. In the simple linear regression case, this would be a horizontal line going through the data

b.)

True, even if predictor variables are perfectly correlated, the model can still be a good fit for the data. Thinking geometrically, the \(dim(X) <p\) in the case of multicollineairty. However, since the space exists, we can still fit the data.

c.)

True, since \(\hat{Y} = \hat{Y}^*\), our sum of squares remains the same, thus having no influence on coefficients of multiple determination.

d.)

True, since the p-1 t-tests are not equivalent to testing whether there is a regression relation between Y and the set of X variables (as tested by the F test). When we have multicollinearity, this can be the case.

e.)

True, say for example we have a variable \(X_1\) and add \(X_2\), which is perfectly correlated with \(X_1\). The SSR remains unchanged, however the MSR becomes smaller. This could lead to having individually significant t tests for each coefficient but an insignificant model.

f.)

True, since both the error variance and the variance of the corresponding least squares estimator would be \(\sigma^2\).

g.)

True, \(\hat{\beta}^*\) is equal to \(\frac{s_y}{s_{x_k}} r_{yx_k}\), which is not influenced by other correlations with other X variables. It measures how much variation in \(X_k\) can explain \(Y\).

h.)

True, since the the inflated variance in \(\hat{\beta}^*_k\) is caused by the intercorrelation between \(X_k\) and the rest of the \(X\) variables.

Question 5

a.)

library(ggplot2)
library(GGally)
library(plotly)
property <- read.table("property.txt")
colnames(property) <-
  c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property
model1 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot +  Vac.Rate, data = property)
model1
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
## 
## Coefficients:
## (Intercept)          Age          Exp      Sq.Foot     Vac.Rate  
##   1.220e+01   -1.420e-01    2.820e-01    7.924e-06    6.193e-01

b.)

summary(model1)
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1872 -0.5911 -0.0910  0.5579  2.9441 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.220e+01  5.780e-01  21.110  < 2e-16 ***
## Age         -1.420e-01  2.134e-02  -6.655 3.89e-09 ***
## Exp          2.820e-01  6.317e-02   4.464 2.75e-05 ***
## Sq.Foot      7.924e-06  1.385e-06   5.722 1.98e-07 ***
## Vac.Rate     6.193e-01  1.087e+00   0.570     0.57    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.137 on 76 degrees of freedom
## Multiple R-squared:  0.5847, Adjusted R-squared:  0.5629 
## F-statistic: 26.76 on 4 and 76 DF,  p-value: 7.272e-14
anova(model1)
# We find the coefficient of partial determination
R_3 <- 0.420 / (98.231 + 0.420)
R_3
## [1] 0.004257433
# partial correlation
r_3 <- sqrt(R_3)
r_3
## [1] 0.06524901

Here \(R^2_{Y3|124}\) is the reduction in the SSE when we add the variable \(X_3\), which here is only 0.3%.

c.)

addX3 <- lm(Vac.Rate ~ Age + Exp + Sq.Foot, data = property)

#Regressing the residuals
fit1 <- lm(model1$residuals ~ addX3$residuals)

ggplot() + aes(x = addX3$residuals, y = model1$residuals) + geom_point() + labs(x = "e(X3|X1,X2,X4)", y = "e(Y|X1,X2,X3,X4)", title = "Added Variable Plot for X3") + geom_abline(intercept = fit1$coefficients[1], fit1$coefficients[2])

The added variable plot implies that \(X_3\) is very little additional help in explaining \(Y\) when \(X_1\), \(X_2\) and \(X_4\) are already in the model.

d.)

model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
fit2 <- lm(model2$residuals ~ addX3$residuals)
summary(fit2)
## 
## Call:
## lm(formula = model2$residuals ~ addX3$residuals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1872 -0.5911 -0.0910  0.5579  2.9441 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)     -1.039e-16  1.239e-01   0.000    1.000
## addX3$residuals  6.193e-01  1.066e+00   0.581    0.563
## 
## Residual standard error: 1.115 on 79 degrees of freedom
## Multiple R-squared:  0.004255,   Adjusted R-squared:  -0.008349 
## F-statistic: 0.3376 on 1 and 79 DF,  p-value: 0.5629

We can see that the the fitted regression slope is the same value as the regression coefficient of \(X_3\) from part b.

e.)

# From part d 
anova(fit2)
anova(model1)

We can see that the \(SSR(residuals) = SSR(X_3 |X_1, X_2, X_4)\). Along with that we can see that both models have the exact same SSE.

f.)

# We find the correlation between the residuals of the models 
r <- cor(model2$residuals, addX3$residuals)
r
## [1] 0.06522951
r^2
## [1] 0.004254889

We can see that the correlation coefficient \(r\) for the residuals is the same as the correlation coefficient \(r_{Y3|124}\). \(r^2 = 0.00425\).

g.)

Yresid <- lm(property$Ren.Rate ~ addX3$residuals)
summary(Yresid)
## 
## Call:
## lm(formula = property$Ren.Rate ~ addX3$residuals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7641 -1.1392 -0.1056  1.1221  4.1630 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      15.1389     0.1921  78.807   <2e-16 ***
## addX3$residuals   0.6193     1.6528   0.375    0.709    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.729 on 79 degrees of freedom
## Multiple R-squared:  0.001774,   Adjusted R-squared:  -0.01086 
## F-statistic: 0.1404 on 1 and 79 DF,  p-value: 0.7089

We can see that the fitted regression coefficient for Y regressing to the residuals is the same as the fitted regression coefficient of \(X_3\) from part b. This can be due to a portion of the variation in the residuals of \(X_1, X_2, X_4\) can be explained by \(X_3\).

Question 6

a.)

#Lets clean up the data and return it to original form 
property <- read.table("property.txt")
colnames(property) <-
  c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property
#Means of each column
colMeans(property)
##     Ren.Rate          Age          Exp     Vac.Rate      Sq.Foot 
## 1.513889e+01 7.864198e+00 9.688148e+00 8.098765e-02 1.606333e+05
# SD, we use sapply function
sapply(property, sd)
##     Ren.Rate          Age          Exp     Vac.Rate      Sq.Foot 
## 1.719584e+00 6.632784e+00 2.583169e+00 1.345512e-01 1.090990e+05
# Standardize the data
property[,2:5] <- scale(property[,2:5])

#Means of each column standardized
round(colMeans(property),4)
## Ren.Rate      Age      Exp Vac.Rate  Sq.Foot 
##  15.1389   0.0000   0.0000   0.0000   0.0000
# S for standardized
sapply(property, sd)
## Ren.Rate      Age      Exp Vac.Rate  Sq.Foot 
## 1.719584 1.000000 1.000000 1.000000 1.000000

b.)

modelSTD <- lm(Ren.Rate ~ Age + Exp + Sq.Foot +  Vac.Rate, data = property)
modelSTD
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
## 
## Coefficients:
## (Intercept)          Age          Exp      Sq.Foot     Vac.Rate  
##    15.13889     -0.94208      0.72850      0.86453      0.08333

\[ Y_i = 15.14 - 0.94X_1 + 0.73X_2 + 0.86X_4 + 0.08X_3 \]

The fitted regression intercept for the standardized model is 15.13889.

c.)

# ANOVA from standardized model
anova(modelSTD)
# ANOVA from model1
anova(model1)

The ANOVA from the standardized model and the original Model 1 have the exact same ANOVA table, with \(SSE = 98.231\), \(SSR = 138.328\) and \(SSTO = 236.559\).

d.)

summary(modelSTD)
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot + Vac.Rate, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1872 -0.5911 -0.0910  0.5579  2.9441 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.13889    0.12632 119.845  < 2e-16 ***
## Age         -0.94208    0.14156  -6.655 3.89e-09 ***
## Exp          0.72850    0.16318   4.464 2.75e-05 ***
## Sq.Foot      0.86453    0.15108   5.722 1.98e-07 ***
## Vac.Rate     0.08333    0.14623   0.570     0.57    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.137 on 76 degrees of freedom
## Multiple R-squared:  0.5847, Adjusted R-squared:  0.5629 
## F-statistic: 26.76 on 4 and 76 DF,  p-value: 7.272e-14

The \(R^2\) and \(R^2_a\) are the same in the standardized model as they are in the original model, \(0.5847\) and \(0.5629\) respectively.

Question 7

a.)

rxx1 <- solve(cor(property[,2:5]))
rxx1
##                 Age        Exp   Vac.Rate    Sq.Foot
## Age       1.2403482 -0.2870567  0.2244927 -0.2495354
## Exp      -0.2870567  1.6482246  0.6092380 -0.6926391
## Vac.Rate  0.2244927  0.6092380  1.3235525 -0.4399669
## Sq.Foot  -0.2495354 -0.6926391 -0.4399669  1.4127219
# Variance inflation factors are the diagonal values
diag(rxx1)
##      Age      Exp Vac.Rate  Sq.Foot 
## 1.240348 1.648225 1.323552 1.412722
sumX1 <- summary(lm(Age ~ Exp + Vac.Rate + Sq.Foot, data = property))
1 / (1 - sumX1$r.squared)
## [1] 1.240348
sumX2 <- summary(lm(Exp ~ Age + Vac.Rate + Sq.Foot, data = property))
1 / (1 - sumX2$r.squared)
## [1] 1.648225
sumX3 <- summary(lm(Vac.Rate ~ Exp +  Age + Sq.Foot, data = property))
1 / (1 - sumX3$r.squared)
## [1] 1.323552
sumX4 <- summary(lm(Sq.Foot ~ Vac.Rate+ Exp +  Age, data = property))
1 / (1 - sumX4$r.squared)
## [1] 1.412722

We can see that the the degree of multicollinearity in this data is nothing to be concerned about, as there is no high correlation between any of the variables. We also do not have any large variance inflation factors over 10.

b.)

YX4model <- lm(Ren.Rate ~ Sq.Foot, data = property)
YX3X4model <- lm(Ren.Rate ~ Vac.Rate + Sq.Foot, data = property)
summary(YX4model)
## 
## Call:
## lm(formula = Ren.Rate ~ Sq.Foot, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1390 -0.7930  0.2890  0.9653  3.4415 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.1389     0.1624  93.215  < 2e-16 ***
## Sq.Foot       0.9204     0.1634   5.632 2.63e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.462 on 79 degrees of freedom
## Multiple R-squared:  0.2865, Adjusted R-squared:  0.2775 
## F-statistic: 31.72 on 1 and 79 DF,  p-value: 2.628e-07
summary(YX3X4model)
## 
## Call:
## lm(formula = Ren.Rate ~ Vac.Rate + Sq.Foot, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1886 -0.7879  0.3140  0.9820  3.4021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.13889    0.16338  92.659  < 2e-16 ***
## Vac.Rate     0.04046    0.16494   0.245    0.807    
## Sq.Foot      0.91717    0.16494   5.561 3.63e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.47 on 78 degrees of freedom
## Multiple R-squared:  0.2871, Adjusted R-squared:  0.2688 
## F-statistic:  15.7 on 2 and 78 DF,  p-value: 1.859e-06

We find that the regression coefficient for \(X_4\) are very close to each other, going from \(0.9204\) to \(0.91717\).

anova(YX4model)
anova(YX3X4model)

The \(SSR(X_4) = 67.775\) and the \(SSR(X_4|X_3) = 66.858\). We find that the \(SSR(X_4)\) is slightly larger than the \(SSR(X_4|X_3)\). This decrease in variation is due to the slight variation that \(X_3\) adds when to the \(SSR\) when included in the model.

c.)

YX2model <- lm(Ren.Rate ~ Exp, data = property)
YX1X2model <- lm(Ren.Rate ~ Age + Exp, data = property)
summary(YX2model)
## 
## Call:
## lm(formula = Ren.Rate ~ Exp, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5733 -0.9093 -0.1559  0.8290  4.7607 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.1389     0.1750   86.49  < 2e-16 ***
## Exp           0.7115     0.1761    4.04 0.000123 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.575 on 79 degrees of freedom
## Multiple R-squared:  0.1712, Adjusted R-squared:  0.1607 
## F-statistic: 16.32 on 1 and 79 DF,  p-value: 0.0001231
summary(YX1X2model)
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6473 -0.9041 -0.1574  0.4652  4.0687 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.1389     0.1535  98.601  < 2e-16 ***
## Age          -0.8330     0.1677  -4.967 3.92e-06 ***
## Exp           1.0354     0.1677   6.175 2.79e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.382 on 78 degrees of freedom
## Multiple R-squared:  0.3704, Adjusted R-squared:  0.3543 
## F-statistic: 22.94 on 2 and 78 DF,  p-value: 1.457e-08

We can see that the estimated regression coefficient for \(X_2\) increases from \(0.7115\) to \(1.0354\).

anova(YX2model)
anova(YX1X2model)

The \(SSR(X_2) = 40.503\) and the \(SSR(X_2|X_1) = 72.802\). We find that the \(SSR(X_2)\) is much smaller than the \(SSR(X_2|X_1)\). This increase in variation is due to the deduction of SSE when \(X_1\) is included in the model.

Question 8

a.)

ggplotly(ggplot(data = property, aes(x = Age, y = Ren.Rate)) + geom_point())

We can see from the plot that there is no tell of a linear relationship between the age of a property and it’s rental rate.

b.)

We have model equation: \[ Y_i = \beta_0 + \beta_1\tilde{X_{i1}} + \beta_2X_{i2} + \beta_4X_{i4} + \beta_1\tilde{X_{i1}^2} \]

property["AgeCent"] <- property$Age - mean(property$Age)
property["AgeSq"] <- property$AgeCent ^ 2

polyModel <-
  lm(Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
summary(polyModel)
## 
## Call:
## lm(formula = Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89596 -0.62547 -0.08907  0.62793  2.68309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.5242     0.2808  51.729  < 2e-16 ***
## AgeCent      -1.2057     0.1692  -7.125 5.10e-10 ***
## AgeSq         0.6224     0.2561   2.431   0.0174 *  
## Exp           0.8112     0.1519   5.340 9.33e-07 ***
## Sq.Foot       0.8778     0.1382   6.351 1.42e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.097 on 76 degrees of freedom
## Multiple R-squared:  0.6131, Adjusted R-squared:  0.5927 
## F-statistic:  30.1 on 4 and 76 DF,  p-value: 5.203e-15
#Plotting Observations Against Fitted Values
ggplotly(
  ggplot() + aes(x = polyModel$fitted.values, y = property$Ren.Rate) + geom_point() + labs(x = "Fitted Values", y = "Observations", title = "Observartions against Fitted Values")
)

We have the regression function: \[ Y_i = 10.19 - 0.182X_{i1} + 0.314X_{i2} + 0.00008X_{i4} + 0.014X_{i1}^2 \] We find that our model is a good fit. We have a relatively good \(R^2_{adj}\) as well as fairly linear Observations against Fitted Values plot.

c.)

# Model 2
model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
summary(model2)
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0620 -0.6437 -0.1013  0.5672  2.9583 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.1389     0.1258 120.374  < 2e-16 ***
## Age          -0.9562     0.1388  -6.891 1.33e-09 ***
## Exp           0.6901     0.1480   4.663 1.29e-05 ***
## Sq.Foot       0.8922     0.1424   6.265 1.97e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.132 on 77 degrees of freedom
## Multiple R-squared:  0.583,  Adjusted R-squared:  0.5667 
## F-statistic: 35.88 on 3 and 77 DF,  p-value: 1.295e-14

We find that both the \(R^2\) and \(R^2_{adj}\) are higher in the quadratic model than the Model 2. The \(R^2\) for Model 2 is \(0.583\) and \(0.6131\) for the quadratic model. The \(R^2_{adj}\) for Model 2 is \(0.5667\) and \(0.5927\) for the quadratic model. This would lead us to conclude that the quadratic model is a better fit than Model 2.

d.)

To test our full model versus our reduced model, we have: \[ H_0: \beta_j = 0\ \text{for all} \ j\in \mathbf J\\ H_a: \text{not all} \ \beta_j: \ j\in \mathbf J\\ \] With test statistic and null distribution: \[ F^* = \frac{\frac{SSE(R)-SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}} \\ F^* \sim F_{(1- \alpha, df_R - df_F, df_F)} \]

We reject \(H_0\) if \(F^* > F_{(1- \alpha, df_R - df_F, df_F)})\).

#Find crtical value
qf(1 - 0.05, 77-76, 77)
## [1] 3.965094
anova(polyModel, model2)

Given that our value for our \(F^*\) is \(5.9078\), we reject \(H_0\) and conclude that our quadratic term is significant in the model at \(\alpha = 0.05\).

e.)

#Our prediction for model 2
predict(model2, data.frame(Age = 4, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
##        fit      lwr    upr
## 1 71396.95 41306.87 101487
# Our prediction for quadratic model 
predict(polyModel, data.frame(AgeCent = 4, AgeSq = 16, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
##        fit      lwr      upr
## 1 70251.52 41040.56 99462.49

We can see that when we predict with a quadratic model, we get a lower valued interval than that of the prediction with Model 2.